Visual Performance in Multidimensional Data Characterisation with Scatterplots and Parallel Coordinates
نویسندگان
چکیده
We present a study on the visual assessment of relative data point distances in Parallel coordinate systems and scatterplots in Cartesian coordinate systems. Specifically, we assess the impact of coordinate system type, dimension, and relative point distance deviation. We performed an online pilot experiment with 100 participants using Amazon’s MechanicalTurk. The experiment design and methodology are presented in detail and results indicate that there may indeed be a difference in human performance when visually assessing distances in the considered coordinate systems. We argue that further investigations are needed to draw stronger conclusions. These should consider inclusion of other factors into the experiment design, such as the relative angle between data points that is expected to have a significant impact on the outcomes. Introduction Visual assessment of graphical perception has been of research interest for several decades with pioneering work by Cleveland et al. [5] on fundamentals of using graphical elements to quantify visual information. Since then, there has been a large body of work covering many aspects of visual perception in visualisation and graphics [21]. Generally, we distinguish between works that focus on the assessment of low level perceptual attributes in the spirit of Cleveland’s work [15] and comparison of high level complex visualisations [1, 19]. Scatterplots in Cartesian coordinate systems have been around for a long time and are widely adopted to visually represent data points. Their limitations of representing multivariate data have sparked the development of new techniques, such as Parallel coordinates [13]. Parallel coordinates have since become a standard tool for the visualisation of multivariate data by representing N-dimensional points as polygonal lines crossing N parallel axes. Similar to Cartesian coordinates, this layout allows one to read off data values at different levels of dimensionality: individual axes represent one-dimensional information, pairs of axes represent two-dimensional projections, and retrieving values from multiple axes provide enough information to reconstruct multidimensional data. While some studies suggest that Cartesian coordinates outperform parallel coordinates in conveying twodimensional linear correlations, others have shown that parallel coordinates may provide a very effective interface for tracing the values of a single data point across multiple dimensions. In this study, we extend this line of research by assessing the visual performance of novice users in value retrieval and comparison/characterisation tasks for Cartesian Coordinates (CC) and Parallel Coordinates (PC). Specifically, we investigate human performance in estimating relative distances between data points in CC and PC in various dimensions. We hypothesise that the performance of PC relative to CC increases with the dimension of the coordinate system. Towards this end, we conducted an on-line psychophysical experiment using Amazon’s MechanicalTurk. We found that there may indeed be a difference in human performance when visually assessing distances in the considered coordinate systems. We argue that further investigations are needed though to draw stronger conclusions with regard to our hypothesis. We consider this experiment to be a pilot to a larger body of work that investigates low level perceptual attributes in data visualisation to effectively represent data properties. The remainder of the paper is organised as follows. In the following section we briefly review some more related work. We then go on introducing in detail the experiment design and methodology, followed by an analysis and discussion of the experiment results. We finish with concluding remarks. Related Work A single scatterplot in a CC system is typically used to visualize points in two dimensions. For N-dimensional data, multiple two-dimensional scatterplots can be used to convey the full dataset in N − 1 (typically axis-aligned) two-dimensional subspaces. These can be arranged in various ways [4, 18, 20], with the scatterplot matrix [8] (SPLOM) being the most common approach. Parallel Coordinates [13] have become a standard technique for the visualization of multidimensional data. Since the first publication [12], many techniques have been proposed to address the most common challenges in traditional PC (see [10] for a recent overview), typically by modifying either the layout of axes or the appearance of lines. While most of these were evaluated in comparison to the traditional, line-based PC plot [14], only little is known about the effectiveness of traditional PC in conveying simple properties of the underlying, multidimensional data. Two independent studies [6, 17] found that scatterplots outperform PC in conveying linear correlation. However, both studies investigated two-dimensional data. Holten & van Wijk [11] further found that participants in their study identified the number of clusters faster and more accurately with a set of scatterplots. While clusters are a multidimensional property, its number does not change once identified in any of the subdimensions. In contrast, the relative distance of points as investigated in this work can only be judged accurately after looking at all dimensions. Kuang et al. [16] compared the performance of a value retrieval task in PC with three variations of scatterplots. Value retrieval is a subtask for many other tasks [2], including the estimation of relative distances. The results of their study show that PC outperform scatterplots in CC for sparse data. This is expected, as PC support the task naturally by resolving the correspondence of point coordinates over multiple dimensions visually, i.e. by connecting them with a line. In order not to confound our results with the value retrieval task, we use color to resolve the correspondence of points between multiple dimensions. Psychophysical Experiment Experiment Design We designed the experiment with the main goal to investigate the relative performance of PC and CC for multi-dimensional data characterisation. We considered 2-dimensional, 3-dimensional, and 4-dimensional coordinate systems. For the purpose of assessment, we simply presented 3 data points in each coordinate system labelled A, B, and C. Observers were instructed to identify the point B or C that is closer to point A. We did not control the overall distance and angles between the data points but the relative distance deviation of A to B and A to C to add variability to the experiment. Given the above, we had three independent variables: coordinate system type T , coordinate system dimension D, and point distance deviation δ . The latter is defined as the absolute difference of the respective distances of A to B and A to C. The details of these independent vairables (IV) are summarised in Table 1. We did a full factorial design of these IVs resulting in 2×3×11 = 66 stimuli. Table 1: Summary of independent variables. Variable name Variable values # Coordinate system type T Parallel / Cartesian 2 Coordinate system dimension D 2 / 3 / 4 3 Point distance deviation δ 0 / 0.05 / 0.1 / 0.15 / 0.2 / 0.25 / 0.3 / 0.35 / 0.4/ 0.45 / 0.5 11 Stimuli Creation We created the stimuli using R. For each plot, three data samples were presented in the respective coordinate system type and dimension. While the IVs summarised in Table 1 were fully controlled, the overall distances and angles between data samples were randomly computed. Example stimuli are presented for 2D, 3D, and 4D coordinate systems in Fig. 1, Fig. 2, and Fig. 3, respectively. While the design of PC lends itself to visual representations in coordinate system of three or more dimensions, CC are most suitably presented in two dimensions. We therefore chose to present three and four dimensions in our experiment as a series of 2D CC systems. Specifically, two 2D CC systems are needed for 3D representation and three 2D CC systems are needed for 4D representation. The axes for all coordinate systems are referred to as Xi, i ∈ {1,2,3,4}. For the PC we chose the most intuitive arrangement for these axes by simply sorting them in increasing order from left to right. The strategy for arranging the CC axes was not as intuitive. Several such strategies are discussed in Kuang et al. [16] and we decided for an arrangement that we consider to be fairest (a)
منابع مشابه
FlowCytoVis: Visualization Tool for Flow Cytometry Data Standards Project
The research in the Terry Fox Laboratory (TFL), BC Cancer Agency, Vancouver, BC involves the use of flow cytometry (FCM) technology. Current methods of visualization of these specific data include scatterplots, histograms and contour diagrams, which have their disadvantages in multidimensional data analysis. The work presented in this paper introduces a new visualization tool for flow cytometry...
متن کاملConnectedCharts: Explicit Visualization of Relationships between Data Graphics
Multidimensional multivariate data can be visualized using many different well-known charts, such as bar charts, stacked bar charts, grouped bar charts, scatterplots, or pivot tables, or also using more advanced highdimensional techniques such as scatterplot matrices (SPLOMs) or parallel coordinate plots (PCPs). These many techniques have different advantages, and users may wish to use several ...
متن کاملA Knowledge Integration Framework for Information Visualization
Users can better understand complex data sets by combining insights from multiple coordinated visual displays that include relevant domain knowledge. When dealing with multidimensional data and clustering results, the most familiar displays and comprehensible are 1and 2-dimensional projections (histograms, and scatterplots). Other easily understood displays of domain knowledge are tabular and h...
متن کاملVisualization Mosaics for Multivariate Visual Exploration
We present a new model for creating composite visualizations of multidimensional datasets using simple visual representations such as point charts, scatterplots, and parallel coordinates as components. Each visual representation is contained in a tile, and the tiles are arranged in a mosaic of views using a space-filling slice-and-dice layout. Tiles can be created, resized, split, or merged usi...
متن کاملA High-Dimensional Data Quality Metric using Pareto Optimality
The representation of data quality within established high-dimensional data visualization techniques such as scatterplots and parallel coordinates is still an open problem. This work offers a scale-invariant measure based on Pareto optimality that is able to indicate the quality of data points with respect to the Pareto front. In cases where datasets contain noise or parameters that cannot easi...
متن کامل